Distributed Computing Using Spark
Supervisor: | Fang Wei-Kleiner, Geb. 051, Raum 01-023 |
---|---|
Language: | Exercise sheets will be written in English. The meetings with the tutor will be in English. |
Study: | Master |
Schedule
All the resourses can be found under ILIAS for this Lab Course. The first kickoff meeting takes place on May 5 2021 at 10AM. Zoom info: Meeting ID: 691 6845 5535 Passcode: v98FNm0s0
Pre-requisites:
Programming in Python (basic knowledge). Attendance in the lecture 'Data Analysis and Query Languages' is higly recommended.
Experiments sheets
This course is based on practical experiment sheets that has to be solved individually or in small groups of two students. The submitted solutions will be marked and discussed with the tutor (compulsory attendance).
Content
The Web has undergone significant changes over the last decade. Currently, in the so-called Web 2.0 users are able to publish their own content, collaborate, discuss and form online communities.
The continuously growing content brings with it new challenges to the current paradigms for combining data, content, and services from multiple sources.
Therefore, sometimes it is not possible to escape from distributed storage and processing when it comes to create personalized experiences and applications.
Apache Spark did bring a revolution to the big data space.
In fact, it has overtaken Hadoop, an open-source, distributed, Java-based framework, which consists of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine.
Spark is nowadays the most active open source Big Data project. It is similar to Hadoop in that it's a distributed, general-purpose computing platform.
However, by being able to keep large amounts of data in memory Spark programs can be executed up to 100 times faster than their MapReduce counterparts.
The purpose of this practical is to learn the capabilities of Spark in an incremental fashion. The developed application will be a Recommender System,
i.e. a system which provides useful suggestions for users.
The student will implement different kinds of recommendation algorithms in a distributed fashion while covering Spark's main capabilities:
MapReduce, batch programming, real-time data-processing functions, SQL-like handling of structured data, graph algorithms, and machine learning.
Material
1. Spark in Action. Petar Zečević and Marko Bonaći.
3. Machine Learning with Spark Book by Nick Pentreath
2. Recommender Systems Handbook - 2nd ed. 2015. Francesco Ricci, Lior Rokach, Bracha Shapira.
Note: access within the university's network.
Used technologies
Docker
Jupyter Notebook
Python
Apache Spark
PySpark (Python API for Spark)